πŸ•” Time Series

Time Series

CandleStick Graphs
Heatmap Graphs (over time)
Line Graphs
Time Series
Author

Arvind Venkatadri

Published

December 15, 2022

Modified

May 26, 2023

Abstract
Events, Trends, Seasons, and Changes over Time

Slides and Tutorials

Time Series Modelling and Forecasting

Setting up R Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.2     βœ” readr     2.1.4
βœ” forcats   1.0.0     βœ” stringr   1.5.0
βœ” ggplot2   3.4.2     βœ” tibble    3.2.1
βœ” lubridate 1.9.2     βœ” tidyr     1.3.0
βœ” purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)  # Deal with dates

library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
library(fpp3) # Robert Hyndman's textbook package
── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
βœ” tsibble     1.1.3     βœ” fable       0.3.3
βœ” tsibbledata 0.4.1     βœ” fabletools  0.3.3
βœ” feasts      0.3.1     
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
βœ– mosaic::count()      masks dplyr::count()
βœ– mosaic::cross()      masks purrr::cross()
βœ– lubridate::date()    masks base::date()
βœ– mosaic::do()         masks dplyr::do()
βœ– Matrix::expand()     masks tidyr::expand()
βœ– dplyr::filter()      masks stats::filter()
βœ– tsibble::intersect() masks base::intersect()
βœ– tsibble::interval()  masks lubridate::interval()
βœ– dplyr::lag()         masks stats::lag()
βœ– fabletools::model()  masks mosaic::model()
βœ– Matrix::pack()       masks tidyr::pack()
βœ– tsibble::setdiff()   masks base::setdiff()
βœ– mosaic::stat()       masks ggplot2::stat()
βœ– mosaic::tally()      masks dplyr::tally()
βœ– tsibble::union()     masks base::union()
βœ– Matrix::unpack()     masks tidyr::unpack()
# Loads all the core timeseries packages, see messages

# devtools::install_github("FinYang/tsdl")
library(tsdl) # Time Series Data Library from Rob Hyndman

library(tsbox) # "new kid on the block"
library(TSstudio) # Each Plots, Decompositions, and Modelling with Time Series

Attaching package: 'TSstudio'

The following object is masked from 'package:tsbox':

    ts_plot

Introduction

Any metric that is measured over regular time intervals forms a time series. Analysis of Time Series is commercially important because of industrial need and relevance, especially with respect to Forecasting (Weather data, sports scores, population growth figures, stock prices, demand, sales, supply…). For example, in the graph shown below are the temperatures over time in two US cities:

What can we do with Time Series? A time series can be broken down to its components so as to systematically understand, analyze, model and forecast it. As with other datasets, we have to begin by answering fundamental questions, such as:

  1. What are the types of time series?
  2. How do we visualize time series?
  3. How do we decompose the time series into level,trend, and seasonal components?
  4. Hoe might we make a model of the underlying process that creates these time series?
  5. How do we make useful forecasts with the data we have?

We will first look at the multiple data formats for time series in R. Alongside we will look at the R packages that work with these formats and create graphs and measures using those objects. We will then look at obtaining the components of the time series and try our hand at modelling and forecasting.

Time Series Data Formats

There are multiple formats for time series data. The ones that we are likely to encounter most are

  • The tibble format: the simplest and most familiar data format is of course the standard tibble/dataframe, with a time column/variable to indicate that the other variables vary with time. The standard tibble object is used by many packages, e.g. timetk & modeltime

  • The ts format: We may simply have a single series of measurements that are made over time, stored as a numerical vector. The stats::ts() function will convert a numeric vector into an R time series ts object, which is the most basic time series object in R. The base-R ts object is used by established packages forecast and is also supported by newer packages such as tsbox.

  • The modern tsibble format: this is a new modern format for time series analysis. The special tsibble object (β€œtime series tibble”) is used by fable, feasts and others from the tidyverts set of packages.

There are many other time-oriented data formats too…probably too many,

Standards

such a tibbletime and TimeSeries objects. For now the best way to deal with these, should you encounter them, is to convert them to a tibble or tsibble and work with these. (Using say tsbox)

Creating and Plotting Time Series

In this first example, we will use simple ts data first, and then do another with tibble format that we can plot as is. We will then do more after conversion to tsibble format, and then a third example with a ground-up tsibble dataset.

Base-R ts format data

There are a few datasets in base R that are in ts format already.

AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
str(AirPassengers)
 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

This can be easily plotted using base R and other more recent packages:

plot(AirPassengers) # Base R
tsbox::ts_plot(AirPassengers) # tsbox static plot
TSstudio::ts_plot(AirPassengers) # TSstudio interactive plot

One can see that there is an upward trend and also seasonal variations that also increase over time.

Let us take data that is β€œtime oriented” but not in ts format. We use the command ts to convert a numeric vector to ts format: the syntax of ts() is:

Syntax: objectName <- ts(data, start, end, frequency), where,

  • data : represents the data vector
  • start : represents the first observation in time series
  • end : represents the last observation in time series
  • frequency : represents number of observations per unit time. For example 1=annual, 4=quarterly, 12=monthly, 7=weekly, etc.

We will pick simple numerical vector data ( i.e. not a time series ) ChickWeight:

ChickWeight %>% head()
# Filter for Chick #1 and for Diet #1
ChickWeight_ts <- ChickWeight %>% 
  filter(Chick == 1, Diet ==1) %>% 
  select(weight, Time)

ChickWeight_ts <- stats::ts(ChickWeight_ts$weight, frequency = 2) 
str(ChickWeight_ts)
 Time-Series [1:12] from 1 to 6.5: 42 51 59 64 76 93 106 125 149 171 ...

Now we can plot this in many ways:

plot(ChickWeight_ts) # Using base-R
#ts_boxable(ChickWeight_ts)
tsbox::ts_plot(ChickWeight_ts,
               ylab = "Weight of Chick #1") # Using tsbox
TSstudio::ts_plot(ChickWeight_ts,
                  Xtitle = "Time", 
                  Ytitle = "Weight of Chick #1") # Using TSstudio

tibble data

Using the familiar tibble structure opens up new possibilities. We can have multiple time series within a tibble (think GDP, Population, Imports, Exports for multiple countries as with the gapminder1data we saw earlier). It also allows for data processing with dplyr such as filtering and summarizing.

  • 1 https://www.gapminder.org/data/

  • gapminder data

    Let us read and inspect in the US births data from 2000 to 2014. Download this data by clicking on the icon below, and saving the downloaded file in a sub-folder called data inside your project.

    Read this data in:

    births_2000_2014 <- read_csv("data/US_births_2000-2014_SSA.csv")
    births_2000_2014

    Plotting tibble time series

    We will now plot this using ggformula.

    With the separate year/month/week and day_of_week / day_of_month columns, we can plot births over time, colouring by day_of_week, for example:

    births_2000_2014 %>% 
    
      gf_line(births ~ year, 
              group = ~ day_of_week, 
              color = ~ day_of_week) %>% 
      
      gf_point() %>% 
      
      gf_theme(scale_colour_distiller(palette = "Paired")) %>% 
      gf_theme(theme_classic())

    Not particularly illuminating. This is because the data is daily and we have considerable variation over time.

    We should calculate the the mean births on a month basis in each year and plot that:

    births_2000_2014 %>% 
             
    # Convert month to factor
      mutate(month = as_factor(month)) %>% 
      
      group_by(year, month) %>% 
      summarise(mean_monthly_births = mean(births, na.rm = TRUE)) %>% 
      
      gf_line(mean_monthly_births ~ year, 
              group = ~ month, 
              colour = ~month) %>% 
      gf_point() %>% 
      
      gf_theme(scale_colour_brewer(palette = "Paired")) %>% 
      gf_theme(theme_classic())

    So…average births per month were higher in 2005 to 2007 and have dropped since. We can do similar graphs using day_of_week as our basis for grouping, instead of month:

    births_2000_2014 %>% 
      mutate(
             # So that we can have discrete colours for each week day
             # Using base::factor()
             # Could use forcats::as_factor() also
             day_of_week = base::factor(day_of_week,
                                        levels = c(1,2,3,4,5,6,7), 
                                        labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>% 
      
      group_by(year, day_of_week) %>% 
      summarise(mean_weekly_births = mean(births, 
                                          na.rm = TRUE)) %>% 
      
      gf_line(mean_weekly_births ~ year, 
                 group = ~ day_of_week, 
                 colour = ~ day_of_week, data = .) %>% 
      gf_point() %>% 
      
      # palette for 12 colours
      gf_theme(scale_colour_brewer(palette = "Paired")) %>% 
    
      gf_theme(theme_classic())

    Looks like an interesting story here…there are significantly fewer births on average on Sat and Sun, over the years! Why? Should we watch Grey’s Anatomy ?

    So far we are simply treating the year/month/day variables are simple numerical variables. We have not created an explicit time or date variable. Let us do that now:

    So there are several numerical variables for year, month, and day_of_month, day_of_week, and of course the births on a daily basis. tsbox::ts_plot needs just the date and the births column to plot with and not be confused by the other numerical columns, so let us create a time column from these three, but retain them for now. We use the lubridate package from the tidyverse:

    births_timeseries <- 
      births_2000_2014 %>% 
      
      mutate(date = lubridate::make_date(year = year,
                                         month = month,
                                         day = date_of_month)) %>% 
      
      select(date, births, year, month,date_of_month, day_of_week)
    
    births_timeseries

    Plotting this directly:

    births_timeseries %>% 
      select(date, births) %>% 
      tsbox::ts_plot()

    births_timeseries %>% 
      select(date, births) %>% 
      TSstudio::ts_plot()

    If we need setup average monthly and weekly births as before, we need to understand more of data processing with time series, similar to what dplyr does for tibbles. We will do this shortly, but using tsibble however.

    tsibble data

    Finally, we have tsibble (β€œtime series tibble”) format data, which contains three main components:

    • an index variable that defines time;
    • a set of key variables, usually categorical, that define sets of observations, over time. This allows for each combination of the categorical variables to define a separate time series.
    • a set of quantitative variables, that represent the quantities that vary over time (i.e index)

    Here is Robert Hyndman’s video introducing tsibbles:

    The package tsibbledata contains several ready made tsibble format data. Let us try PBS, which is a dataset containing Monthly Medicare prescription data in Australia.

    Run data(package = "tsibbledata") in your Console to find out about these.
    data("PBS")
    PBS

    This is a large-ish dataset:

    • 67K observations
    • 336 combinations of key variables (Concession, Type, ATC1, ATC2) which are categorical, as foreseen.
    • Data appears to be monthly, as indicated by the 1M.
    • the time index variable is called Month

    Note that there are multiple Quantitative variables (Scripts,Cost), a feature which is not supported in the ts format, but is supported in a tsibble. The Qualitative Variables are described below.

    Type help("PBS") in your Console.

    The data is dis-aggregated/grouped using four keys:

    • Concession: Concessional scripts are given to pensioners, unemployed, dependents, and other card holders
    • Type: Co-payments are made until an individual’s script expenditure hits a threshold ($290.00 for concession, $1141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount.
    • ATC1: Anatomical Therapeutic Chemical index (level 1). 15 types
    • ATC2: Anatomical Therapeutic Chemical index (level 2). 84 types, nested inside ATC1

    Let us simply plot Cost over time:

    PBS %>% 
      gf_point(Cost ~ Month, data = .) %>% 
      gf_line() %>% 
      gf_theme(theme_classic())

    This basic plot is quite messy.

    tsibble has dplyr-like functions

    We can use dplyr functions such as mutate(), filter(), select() and summarise() to work with tsibble objects. tsibble does not allow filtering based on categorical variables, that needs to be done with dplyr.

    However, tsibble has specialized functions to do with the index (i.e time) variable and the key variables, things similar to what dplyr does.

    Let us first see how many observations there are for each combo of keys:

    PBS %>% 
      tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>% 
      dplyr::count()

    We have 336 combinations of Qualitative variables, each combo containing 204 observations (except some!): so let us filter for a few such combinations and plot:

    PBS %>% 
      tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>%
      gf_line(Cost ~ Month, 
              colour = ~ Type, 
              data = .) %>% 
      gf_point() %>% 
      gf_theme(theme_classic())
    # For a specific combo of Qual variables(keys)
    PBS %>% 
      dplyr::filter(Concession == "General", 
                          ATC1 == "A",
                          ATC2 == "A10") %>% 
      
      gf_line(Cost ~ Month, 
              colour = ~ Type, 
              data = .) %>% 
      gf_point() %>% 
      
      gf_theme(theme_classic())

    As can be seen, very different time patterns based on the two Types of payment methods. Strongly seasonal for both, with seasonal variation increasing over the years, but there is an much stronger upward trend with the Co-payments method of payment.

    We can use tsibble’s dplyr-like commands to develop summaries by year, quarter, month(original data): Look carefully at the new time variable created each time:

    # Cost Summary by Year
    PBS %>% 
      tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>% 
      index_by(year(Month)) %>% 
      summarise(mean = mean(Cost, na.rm = TRUE))
    # Cost Summary by Quarter
    PBS %>% 
      tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>% 
      tsibble::index_by(yearquarter(Month)) %>% 
      dplyr::summarise(mean = mean(Cost, na.rm = TRUE))
    # Cost Summary by Month, which is the original data
    # Only grouping happens here
    PBS %>% 
      tsibble::group_by_key(ATC1, ATC2, Concession, Type) %>% 
      index_by() %>% 
      summarise(mean = mean(Cost, na.rm = TRUE))
    # Original Data
    PBS

    Finally, it may be a good idea to convert some tibble into a tsibble to leverage some of functions that tsibble offers:

    births_tsibble <- births_2000_2014 %>% 
      
      mutate(date = lubridate::make_date(year = year,
                                         month = month,
                                         day = date_of_month)) %>%
      # Convert to tsibble
      tsibble::as_tsibble(index = date) # Time Variable
    
    births_tsibble

    This is DAILY data of course. Let us say we want to group by month and plot mean monthly births as before, but now using tsibble and the index variable:

    births_tsibble %>%
      gf_line(births ~ date, data = .) %>% 
      gf_theme(theme_classic())

    # Very busy plot
    # Try to group by month and take average as before
    # this time with tsibble
    # 
    births_tsibble %>% 
      tsibble::index_by(month_index = ~ tsibble::yearmonth(.)) %>% 
      
      # Monthly Birth Averages 
      dplyr::summarise(mean_births = mean(births, na.rm = TRUE)) %>% 
      
      gf_point(mean_births ~ month_index, data = .) %>% 
      gf_line() %>% 
      gf_smooth(se = FALSE, method = "loess") %>% 
      gf_theme(theme_minimal())

    Apart from the bump during in 2006-2007, there are also seasonal trends that repeat each year, which we glimpsed earlier.

    births_tsibble %>% 
      tsibble::index_by(year_index = ~ lubridate::year(.)) %>% 
    
      # Annual Birth Averages now
      dplyr::summarise(mean_births = mean(births, na.rm = TRUE)) %>%
      
      gf_point(mean_births ~ year_index, data = .) %>% 
      gf_line() %>% 
      gf_smooth(se = FALSE, method = "loess") %>% 
      gf_theme(theme_minimal())

    Ah yes….

    #|label: Why not use dplyr group_by for tsibbles?
    #| layout-ncol: 2
    
    births_tsibble %>% 
      dplyr::group_by(year) %>% 
    # This grouping does not give a proper result
    # The grouping by `index` is different
    # Annual Birth Average as before
      summarise(mean_births = mean(births, na.rm = TRUE)) 
    # Should give 15 rows but does not!
    # The original dataset does, however.
    
    births_tsibble %>% 
      tsibble::index_by(year) %>% 
      dplyr::summarise(mean_births = mean(births, na.rm = TRUE)) 
    # 15 rows, one for each year

    Candle-Stick Plots

    Hmm…can we try to plot boxplots over time (Candle-Stick Plots)? Over month / quarter or year?

    Monthly Box Plots

    # Monthly box plots
    births_tsibble %>%
      index_by(month_index = ~ yearmonth(.)) %>% 
      # 15 years
      # No need to summarise, since we want boxplots per year / month
      gf_boxplot(births ~ date, 
                 group =  ~ month_index, 
                 fill = ~ month_index, data = .) %>%  
      # plot the groups
      # 180 plots!!
      gf_theme(theme_minimal())

    Quarterly boxplots

    births_tsibble %>%
      index_by(qrtr_index = ~ yearquarter(.)) %>% # 60 quarters over 15 years
      # No need to summarise, since we want boxplots per year / month
      gf_boxplot(births ~ date, 
                 group = ~ qrtr_index,
                 fill = ~ qrtr_index,
                 data = .) %>%  # 60 plots!!
      gf_theme(theme_minimal())

    Yearwise boxplots

    births_tsibble %>% 
      index_by(year_index = ~ lubridate::year(.)) %>% # 15 years, 15 groups
        # No need to summarise, since we want boxplots per year / month
    
      gf_boxplot(births ~ date, 
                  group = ~ year_index, 
                  fill = ~ year_index, 
                 data = .) %>%  # plot the groups 15 plots
      gf_theme(scale_fill_distiller(palette = "Spectral")) %>% 
      gf_theme(theme_minimal())

    Although the graphs are very busy, they do reveal seasonality trends at different periods.

    Conclusion

    We have seen a good few data formats for time series, and how to work with them and plot them. We have also seen how to decompose time series into periodic and aperiodic components, which can be used to make business decisions.

    In the Tutorial @sec–slides-and-tutorials, we will explore modelling and forecasting of timeseries.

    Your Turn

    1. Choose some of the datasets in the tsdl and in the tsibbledata packages. Plot basic, filtered and model-based graphs for these and interpret.

    References

    1. Robert Hyndman, Forecasting: Principles and Practice (Third Edition). available online

    2. Time Series Analysis at Our Coding Club

    Readings

    1. The Nuclear Threatβ€”The Shadow Peace, part 1

    2. 11 Ways to Visualize Changes Over Time – A Guide

    3. What is seasonal adjustment and why is it used?

    4. The start-at-zero rule

    5. Keeping one’s appetite after touring the sausage factory

    6. How Common is Your Birthday? This Visualization Might Surprise You

    7. The Fallen of World War II

    8. Visualizing Statistical Mix Effects and Simpson’s Paradox

    9. How To Fix a Toilet (And Other Things We Couldn’t Do Without Search)